0%

Ubuntu faiss安装并利用flask提供向量搜索服务API

前言 faiss 简介

三月初,Facebook AI Research(FAIR)开源了一个名为 Faiss 的库,Faiss 主要用于有效的相似性搜索(Similarity Search)和稠密矢量聚类(Clustering of dense vectors),包含了在任何大小的矢量集合里进行搜索的算法。Faiss 上矢量集合的大小甚至可以大到装不进 RAM。这个库基本上是用 C++ 实现的,带有可选的通过 CUDA 提供的 GPU 支持,以及一个可选的 Python 接口。

通过 Faiss 进行相似性搜索时,10 亿图像数据库上的一次查询仅耗时 17.7 微秒,速度较之前提升了 8.5 倍,且准确度也有所提升。

Github : https://github.com/facebookresearch/faiss
Wiki : https://github.com/facebookresearch/faiss/wiki

参照官方的 INSTALL.md 开始安装过程.

安装 Conda

安装 FAISS 最简单的方法是通过 anaconda。我们经常将稳定版本推送到 conda

  1. Conda 是什么
    conda 是一个 Python 科学计算环境. Anaconda 是 Python 的科学计算工具包。根据对 Python2 和 Python3 的支持,分为 Anaconda2 和 Anaconda3。官网提供的是最新的版本

  2. 下载并安装 Conda
    根据 官方安装文档, 首先下载 Anaconda , 之后通过以下命令开始安装, 并根据屏幕上的提示确认安装设置. 稍等一会即可安装完成.

  • 特别注意, 在安装完相关计算环境之后, 会提示你是否将 anaconda 安装路径加入到环境变量. 请输入 yes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.2.0-Linux-x86_64.sh

$ bash Anaconda3-5.2.0-Linux-x86_64.sh

Welcome to Anaconda2 5.2.0

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>
===================================
Anaconda End User License Agreement
===================================

Copyright 2015, Anaconda, Inc.

All rights reserved under the 3-clause BSD License:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of Anaconda, Inc. ("Anaconda, Inc.") nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PAR
TICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ANACONDA, INC. BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBS
TITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWI
SE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Notice of Third Party Software Licenses
=======================================

Anaconda Distribution contains open source software packages from third parties. These are available on an "as is" basis and subject to their individual license agreements. These licenses are available in Anac
onda Distribution or at http://docs.anaconda.com/anaconda/pkg-docs. Any binary packages of these third party tools you obtain via Anaconda Distribution are subject to their individual licenses as well as the A
naconda license. Anaconda, Inc. reserves the right to change which third party tools are provided in Anaconda Distribution.

In particular, Anaconda Distribution contains re-distributable, run-time, shared-library files from the Intel(TM) Math Kernel Library ("MKL binaries"). You are specifically authorized to use the MKL binaries w
ith your installation of Anaconda Distribution. You are also authorized to redistribute the MKL binaries with Anaconda Distribution or in the conda package that contains them. Use and redistribution of the MKL
binaries are subject to the licensing terms located at https://software.intel.com/en-us/license/intel-simplified-software-license. If needed, instructions for removing the MKL binaries after installation of A
naconda Distribution are available at http://www.anaconda.com.

Anaconda Distribution also contains cuDNN software binaries from NVIDIA Corporation ("cuDNN binaries"). You are specifically authorized to use the cuDNN binaries with your installation of Anaconda Distribution
. You are also authorized to redistribute the cuDNN binaries with an Anaconda Distribution package that contains them. If needed, instructions for removing the cuDNN binaries after installation of Anaconda Dis
tribution are available at http://www.anaconda.com.


Anaconda Distribution also contains Visual Studio Code software binaries from Microsoft Corporation ("VS Code"). You are specifically authorized to use VS Code with your installation of Anaconda Distribution.
Use of VS Code is subject to the licensing terms located at https://code.visualstudio.com/License.

Cryptography Notice
===================

This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. B
EFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See
the Wassenaar Arrangement http://www.wassenaar.org/ for more information.

Anaconda, Inc. has self-classified this software as Export Commodity Control Number (ECCN) 5D992b, which includes mass market information security software using or performing cryptographic functions with asym
metric algorithms. No license is required for export of this software to non-embargoed countries. In addition, the Intel(TM) Math Kernel Library contained in Anaconda, Inc.'s software is classified by Intel(TM
) as ECCN 5D992b with no license required for export to non-embargoed countries and Microsoft's Visual Studio Code software is classified by Microsoft as ECCN 5D992.c with no license required for export to non
-embargoed countries.

The following packages are included in this distribution that relate to cryptography:

openssl
The OpenSSL Project is a collaborative effort to develop a robust, commercial-grade, full-featured, and Open Source toolkit implementing the Transport Layer Security (TLS) and Secure Sockets Layer (SSL) pr
otocols as well as a full-strength general purpose cryptography library.

pycrypto
A collection of both secure hash functions (such as SHA256 and RIPEMD160), and various encryption algorithms (AES, DES, RSA, ElGamal, etc.).

pyopenssl
A thin Python wrapper around (a subset of) the OpenSSL library.

kerberos (krb5, non-Windows platforms)
A network authentication protocol designed to provide strong authentication for client/server applications by using secret-key cryptography.

cryptography
A Python library which exposes cryptographic recipes and primitives.


Do you accept the license terms? [yes|no]
[no] >>>
Please answer 'yes' or 'no':'
>>> yes

Anaconda2 will now be installed into this location:
/root/anaconda2

- Press ENTER to confirm the location
- Press CTRL-C to abort the installation
- Or specify a different location below

[/root/anaconda2] >>>
PREFIX=/root/anaconda2
installing: python-2.7.15-h1571d57_0 ...
Python 2.7.15 :: Anaconda, Inc.
installing: blas-1.0-mkl ...

...

installation finished.
Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /root/.bashrc ? [yes|no]
[no] >>> yes

Appending source /root/anaconda2/bin/activate to /root/.bashrc
A backup will be made to: /root/.bashrc-anaconda2.bak


For this change to become active, you have to open a new terminal.

Thank you for installing Anaconda2!

===========================================================================

Anaconda is partnered with Microsoft! Microsoft VSCode is a streamlined
code editor with support for development operations like debugging, task
running and version control.

To install Visual Studio Code, you will need:
- Administrator Privileges
- Internet connectivity

Visual Studio Code License: https://code.visualstudio.com/license

Do you wish to proceed with the installation of Microsoft VSCode? [yes|no]
>>> no
$

如果没输入就要配置环境,根据提示,在终端输入

1
sudo vim /etc/profile

打开 profile 文件。在最后添加语句export PATH=export PATH=/root/anaconda3/bin:$PATH,保存,退出。
最后source /etc/profile使配置生效.

  1. 测试 Conda 是否安装成功
    在终端中输入
1
conda list

若发现输出了 conda 所安装的库列表, 则代表 Conda 安装成功.

再试一下 在 Python 中

1
2
3
4
Python 2.7.15 |Anaconda, Inc.| (default, May  1 2018, 23:32:55)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy

开头提示信息出现 Anaconda 字样, 且 import 无报错, 则 Anaconda 安装过程完成.

安装 OpenBLAS

BLAS 即是 Basic linear Algebra Subprograms,基本线性代数子程序,主要包括矩阵和矩阵,矩阵和向量,向量和向量操作,是科学和工程计算的基础数学库之一。

OpenBLAS 的开源项目源于 GotoBLAS 项目,大致从 2011 年开始,当前的稳定版本是 0.2.14,主要开发人员才三个,贡献者多达 44 人;已经进入主流 Linux 发行版的源,并且成为 MIT 以及其他如 GNU,DL 等的主要依赖库。

  1. 通过 conda 安装 openblas
1
conda install openblas
  1. 创建软连接
1
ln -s $HOME/anaconda3/lib/libopenblas.so.0 /usr/lib/libopenblas.so.0

安装 faiss

源码编译安装方式可以参考 官方文档

  1. 通过 Conda 安装 faiss
    我这里只安装了 CPU 版本, gpu 版本需要先提前安装 CUDA. 参考下面的安装命令安装即可
1
2
3
4
5
6
7
8
9
10
11
# CPU 版本
# CPU version only
conda install faiss-cpu -c pytorch

# GPU 版本
# Make sure you have CUDA installed before installing faiss-gpu,
# otherwise it falls back to CPU version
conda install faiss-gpu -c pytorch # [DEFAULT]For CUDA8.0, comes with cudatoolkit8.0
conda install faiss-gpu cuda90 -c pytorch # For CUDA9.0
conda install faiss-gpu cuda91 -c pytorch # For CUDA9.1
# cuda90/cuda91 shown above is a feature, it doesn't install CUDA for you.
  1. 测试 faiss
    在终端中打开 Python 解释器, 尝试 import faiss, 无报错则安装完成.
1
2
3
4
5
$ python
Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 23:32:55)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import faiss
  1. 运行 Demo
    官方 Python Demo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Copyright (c) 2015-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the BSD+Patents license found in the
# LICENSE file in the root directory of this source tree.

import numpy as np

d = 64 # dimension
nb = 100000 # database size
nq = 10000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

import faiss # make faiss available
index = faiss.IndexFlatL2(d) # build the index
print(index.is_trained)
index.add(xb) # add vectors to the index
print(index.ntotal)

k = 4 # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
D, I = index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries

搭建向量搜索 API

Faiss 是什么?

比如,Faiss 就可以类比为一个可以设置索引的数据库。
索引是干什么的? 更快的读取,数据库是干什么的?增删改查。数据库里存的什么?通常来讲是许多记录,但对于 Faiss 来讲就是巨多的向量。
只是在 Faiss 中没有数据库存储介质这一层的概念,全部都是 Index。

Index 在 Faiss 中是什么角色?

还是类比数据库的索引,为了更快的查数据,我们可以学字典一样,以首字母建立索引,也可以像早期的谷歌一样,使用倒排索引(inverted index)。
不同的索引方式,有不同的优缺点,Faiss 已经全部实现好了。 如果只是为了使用,可以暂时忽略它们的实现原理,只需要了解各自特点以及自己的使用场景即可。

上图中主要表达了一个使用 faiss 来进行搜索的 API 接口的流程,不同于以往的跟数据互动的方式 (结构数据库,非结构数据库,图数据库等),faiss 只是一个比较简陋的开源库,他并没有完整的提供一套解决方案。类比的话就好像 elasticsearch 和 solr 中的 lucene 包一样,而我要做的就是在此基础上二次开发,提供一套可用的解决方案。考虑到 faiss 是一个 C++ 的开源工具库,它只提供了 python 的接口,所以只能使用 python 来做这次接口开发。考察过后决定技术选型为 flask+uwsgi+faiss 来完成这个接口。

环境依赖

  • faiss 的库
  • python3.6
  • uwsgi
  • pycharm

参考

专门开一个标题来感谢一下这哥们 plippe faiss-web-service。他开源的这个 demo 满足了我百分 60 以上的需求,本次的开发基本就是在读懂他的代码以后才能这么胸有成竹。

python 的 API 如何开发

用 flask 开发一个 web 接口,首先是主入口

1
2
3
4
5
6
7
8
9
10
11
12
13
from flask import Flask
from faiss_index import blueprint as FaissIndexBlueprint

app = Flask(__name__)


app.config.from_pyfile('config.py')

app.register_blueprint(FaissIndexBlueprint.blueprint)


if __name__ == '__main__':
app.run()

其次是在主入口中注册的模块 (Blueprint)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from jsonschema import validate, ValidationError
from flask import Blueprint, jsonify, request
from werkzeug.exceptions import BadRequest
from faiss_index import FaissIndex
import json

try:
import uwsgi
except ImportError:
print('Failed to load python module uwsgi')
print('Periodic faiss index updates isn\'t enabled')

uwsgi = None

blueprint = Blueprint('faiss_index', __name__)

@blueprint.route('/ping')
def ping():
return "pong"

OK 直接运行 app.py 就可以用 flask 自带的 wsgi 服务器启动 app

搜索的参数

API 调用方要提供的参数是一维向量,目的是搜索距离最近的 K 个向量,向量具体在程序中怎么表示呢?就是一维数组。比如 [1,2,3,4,5],但是在拿到向量后不能马上进行搜索,要进行处理,如下

1
2
vectors = [np.array(vectors, dtype=np.float32)]
vectors = np.atleast_2d(vectors)

部署的方式 - Docker

相比其他 python web API 来说,faiss 搜索有一点特殊的地方,就是它最重要的依赖 faiss 本身。而 faiss 本身安装有两种方式

  • 通过下载源代码进行编译 非常麻烦
  • 通过 anconda 进行安装 一行代码完成
    但是服务器上本身是可能有其他 python 环境在运行的。不可能专门的让我为了部署我的工程而去改动,可能还能通过 env 的方式解决?但是我不够熟悉 python,所以选择了我擅长的隔离方式 Docker。原理就是从 ubunt 的 image 开始构建,首先构建出一个 faiss 运行环境的 image,然后在第二个 image 上打包自己的 flask 程序。附上构建运行环境的 Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ARG IMAGE
FROM ${IMAGE}

ARG FAISS_CPU_OR_GPU
ARG FAISS_VERSION

RUN apt-get update && \
apt-get install -y curl bzip2 && \
curl https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh > /tmp/conda.sh && \
bash /tmp/conda.sh -b -p /opt/conda && \
/opt/conda/bin/conda update -n base conda && \
/opt/conda/bin/conda install -y -c pytorch faiss-${FAISS_CPU_OR_GPU}=${FAISS_VERSION} && \
apt-get remove -y --auto-remove curl bzip2 && \
apt-get clean && \
rm -fr /tmp/conda.sh

ENV PATH="/opt/conda/bin:${PATH}"

以及我打包到 dockerhub 上的镜像 faiss-docker
可以在 docker 中搜索我的镜像 pull 下来使用

1
2
docker search huangqq
docker pull huangqq/faiss-docker:1.2.1-cpu